{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cf5dba9a-d530-4a0a-ae71-2d741f7e705f",
   "metadata": {},
   "source": [
    "This notebook allows calculating the values for `b` (the number of bands) and `r` (the number of minhashes in a band) used in the fuzzy dedup algorithm. The default values are `b=14` and `r=8`, as defined in the [FineWeb datasets paper](https://arxiv.org/pdf/2406.17557). The x-axis of the graph represents the Jaccard similarity between a pair of documents, while the y-axis represents the probability that they become duplication candidates. Please refer to http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf for more details on this methodology."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "800bc113-8b5e-4cec-8717-98fa05753bd0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Define the parameterized function\n",
    "def f(s, r, b):\n",
    "    return 1 - (1 - s**r)**b\n",
    "\n",
    "# Set the parameters r and b\n",
    "r = 8\n",
    "b = 14\n",
    "\n",
    "# Generate values for s in a range, e.g., from 0 to 1\n",
    "s_values = np.linspace(0, 1, 500)  # 500 points between 0 and 1\n",
    "f_values = f(s_values, r, b)\n",
    "\n",
    "# Plot the function\n",
    "plt.figure(figsize=(8, 6))\n",
    "plt.plot(s_values, f_values, label=fr\"$f(s) = 1 - (1 - s^{{{r}}})^{{{b}}}$\", color='blue')\n",
    "plt.xlabel(\"s\")\n",
    "plt.ylabel(\"f(s)\")\n",
    "plt.title(f\"Plot of the function $f(s) = 1 - (1 - s^{{{r}}})^{{{b}}}$\")\n",
    "plt.legend()\n",
    "plt.grid(True)\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98016b04-b6a0-465d-b65b-6d402978c9f0",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
